Text as data - A game changer

Bennett Kleinberg

30 March 2021

Today

The (mysterious) case of text data

  • what is text anyway?
  • and what is it to data science?
  • what does it add?

Unlocking the power of text data

Why text is special

  • text is everywhere
  • and everything is text
  • think of online forums, emails, messages, transcripts, …
  • the most used source in computational social science research

Example: inaugural speeches of presidents

US presidents

  • they all give inaugural speeches
  • how do they differ?
  • are there special words that characterise, say, Obama vs Trump?

Working with a “corpus”

  • “inaugural corpus” of 58 speeches
  • of all US presidents
  • from 1789 (George Washington) to 2017 (Donald Trump)
  • variables: year, president, firstname, party

A first glimpse

  • total corpus size: 148,770 tokens
  • average tokens/speech: 2,565
  • average sentences/speech: 86.47
  • shortest speech: George Washington (1793) with 147 tokens
  • longest speech: Harrison (1841) with 9123 tokens

1793 Washington

“Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America. […]”

Over time?

Trend?

By party?

But!

  • this is just about the text (meta-variables)
  • what about the actual text?

Going beyond summary variables…

What do they actually say?

  • sentiment analysis
  • the idea of n-grams
    • “I am again called upon by the voice of my country”
    • Unigrams: individual tokens (“I”, “am”, “again”)
    • Bigrams: sequences of two tokens (“I_am”, “am_again”, …)
    • Trigrams: sequences of three tokens (“I_am_again”, “am_again_called”, …)

Sentiment in speeches

  • we use a sentiment lexicon
  • and map each word to its known sentiment (positive vs negative vs neutral)
  • then we calculate the \(ratio_i = \frac{\sum{positive_i}}{\sum{negative_i}}\)
  • that ratio is 1.00 of both occur equally often
  • higher than 1 = more positive words
  • smaller than 1 = more negative words

Sentiment ratio

N-grams

Most used unigrams

##     people government         us        can       upon       must      great 
##        575        564        478        471        371        366        340 
##        may     states      shall 
##        338        333        314

Visually

Most used ngrams (1,2,3)

##     of_the     in_the     to_the     of_our     people government         us 
##       1766        812        720        619        575        564        478 
##        can    and_the       upon 
##        471        471        371

By party?

name ngrams
Democratic us, people, can, government, must, nation, new, world, shall, every
Republican people, government, can, us, must, upon, world, great, country, peace

Visually

Obama vs Trump?

name ngrams
Obama us, must, can, people, nation, new, time, every, america, now
Trump america, american, people, country, one, every, never, great, nation, new

Maths with text

  • each speech can be represented as a vector


document rights justice important revolution individual system many
53 1997 1 1 1 2 0 0 0
54 2001 1 3 2 0 0 0 5
55 2005 5 6 0 0 0 0 1
56 2009 1 0 0 1 1 1 3
57 2013 2 2 0 0 2 0 1
58 2017 0 1 0 0 0 1 5

Distances between vectors

Suppose we have got two vectors:

  • \(\vec{v_1} = [1, 2]\)
  • \(\vec{v_2} = [4, 3]\)

Euclidean distance

Uses Pythagorean theorem.

For two 2-dimensional locations:

  • build a right triangle
  • use \(a^2 = b^2 + c^2\) to calculate the length of the hypotenuse \(c\)

## Maths with text

  • Each document is a vector
  • So we can calculate vector similarities
  • e.g. multidimensional Euclidean distance

Similarity in speeches

  • Who was more “repetitive” - Obama or G.W. Bush?
  • Who was Trump the most similar to?
  • Who was the “least” like Trump?

Obama vs Bush

  • G.W. Bush 2001 vs G.W. Bush 2005: \(d(2001, 2005) = 51.37\)
  • Obama 2009 vs Obama 2013: \(d(2009, 2013) = 45.72\)

Obama’s speeches were more similar to one another

Most similar to Trump?

Year Trump
2017 0.00000
2001 43.26662
1977 48.27007
1945 50.07994
1905 51.34199
1793 51.46844

These are: Bush (R), Carter (D), Roosevelt (D)

“Opposite”" of Trump?

Year Trump
1841 290.5013
1845 153.2057
1909 148.8456
1897 123.7901
1821 122.0819
1889 116.2196

These are: Harrison, Polk, Taft, McKinley, Monroe, Harrison

What’s more?

Text as data

  • text data are everywhere
  • but: text data are super challenging
  • “quantification challenge”
  • just a tiny glimpse
  • NLP at the forefront of recent AI developments

Why bother?

  • new paths for research
    • phenomena of the online world
    • online data as a proxy for offline behaviour
    • the offline-online-offline nexus
  • touches every business area
  • will become standard
  • best embedded in solid social/behavioral science framework

Thanks.